[ENH] experimental design for model indexing framework #1583

fkiraly · 2025-12-31T01:21:52Z

This PR is an experimental draft showcasing:

a light-weight packaging layer for objects of any kind, using python code serialization rather than "translation"
a new taxon "model" which corresponds to (uninstantiated) classes
an example implementation for tabular classifiers, which adds records for two examples: XGBClassifier from xgboost, and AutoSklearnClassifier from auto-sklearn

Note: this is currently not using any backend - the idea is that the light-weight packaging layer can be easily serialized by the backend.

For a broader design of this and adjacent features, see the architecture document https://github.com/gc-os-ai/openml-project-dev/pull/14

Depends on:

[ENH] move utils module to folder #1612

Usage pattern for end users

End users would interact with the layer as follows:

import openml

clf1 = openml.get("XGBClassifier")
clf2 = openml.get("AutoSklearnClassifier")
clf3 = openml.get("RandomForestClassifier")

Either line will:

if the required soft dependencies (e.g., xgboost, auto-sklearn, scikit-learn) are present, directly import the class from the package.
if the required soft dependencies are not present, raise an informative error message

The strings live in a unique namespace for objects - usually, for widely known packages (but not necessarily), they will correspond to class names

This could be further extended to:

a deps utility that, for a namespace ID provides required dependencies - the tags already are inspectable here
tracking of dependencies on a more granular level for reproducibility, e.g., the exact version with which a benchmark in which a model figures was carried out, via pip freeze.

Versioning inside openml is consciously avoided in this design, as versioning already happens in the dependencies which are tracked by the tag system.

The logic internally works through:

a registry lookup mechanism
a unified, but openml-private materialize interface that obtains the class from the package layer

Usage patterns for estimator contributors

Currently, an extender would interact with this by manually adding a class in openml.models.classifiers, in the simplest case they are thin wrappers pointing to a third party location:

from openml.models.apis import _ModelPkgClassifier

class OpenmlPkg__AutoSklearnClassifier(_ModelPkgClassifier):
    _tags = {
        "pkg_id": "AutoSklearnClassifier",
        "python_dependencies": "auto-sklearn",
    }

    _obj = "autosklearn.classification.AutoSklearnClassifier"

But this class could also be reliant on a full python definition of a class itself, with _obj pointing to a python file, or the method _materialize implementing explicit python code.

This can also be automated further in the context of a database backend:

creating a dynamic packaging class from an in-memory object
call of this from a publish method that automatically extracts tags etc, similar to OpenMLFlow.publish
or, a publish method applying to an entire package, crawling it, and extracting pointer records to all sklearn estimators

Intended backend interaction pattern

A database backend would be inserted as follows.

for publish / populate

publish creates a dynamic package class inheriting from the new _BasePkg
calls serialize on it to obtain a string
calls get_tags to obtain a dict of packaging metadata
optionally, adds data about publish process metadata (author, timestamp, etc)
publishes both to the database - DB API call

for query

DB API call for lookup mechanism
retrieve previously published record - also DB API call
invert serialize by zlib decompress and exec to obtain a dynamic python package class
materialize from that class

for more information, see https://pre-commit.ci

codecov-commenter · 2026-01-08T02:03:47Z

Codecov Report

❌ Patch coverage is 49.47368% with 96 lines in your changes missing coverage. Please review.
✅ Project coverage is 53.10%. Comparing base (c5f68bf) to head (f6f050a).

Files with missing lines	Patch %	Lines
openml/base/_base_pkg.py	23.80%	48 Missing ⚠️
openml/models/base/_base.py	40.00%	18 Missing ⚠️
openml/utils/_indexing/_preindex_sklearn.py	0.00%	14 Missing ⚠️
openml/utils/_inmemory/_dict.py	0.00%	11 Missing ⚠️
openml/__init__.py	71.42%	2 Missing ⚠️
openml/models/apis/_classifier.py	75.00%	2 Missing ⚠️
openml/models/_get.py	96.15%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1583      +/-   ##
==========================================
+ Coverage   53.02%   53.10%   +0.07%     
==========================================
  Files          36       51      +15     
  Lines        4326     4514     +188     
==========================================
+ Hits         2294     2397     +103     
- Misses       2032     2117      +85

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

for more information, see https://pre-commit.ci

SimonBlanke · 2026-01-23T06:49:21Z

openml/base/_base_pkg.py

+        "pkg_id": None,  # object id contained, "__multiple" if multiple
+        "pkg_obj": "reference",  # or "code"
+        "pkg_obj_type": None,  # openml API type
+        "pkg_compression": "zlib",  # compression


This just selects the compression algorithm, right?

SimonBlanke · 2026-01-23T06:54:42Z

openml/base/_base_pkg.py

+            return cls_str
+
+        cls_str = cls_str.encode("utf-8")
+        exec(f"import {compress_method}")


I guess this seems save, because the tag is not user-supplied code. But it still makes me nervous to see this. We could just add a simple mapping layer to restrict what can be executed in here. Something like this:

COMPRESSION_MODULES = { "zlib": "zlib", "gzip": "gzip", "bz2": "bz2", "lzma": "lzma", }

SimonBlanke · 2026-01-23T06:59:01Z

openml/base/_base_pkg.py

+
+        cls_str = cls_str.encode("utf-8")
+        exec(f"import {compress_method}")
+        return eval(f"{compress_method}.compress(cls_str)")


Couldn't we just use importlib for this? Something like this:

module = importlib.import_module(compress_method) cls_bytes = cls_str.encode("utf-8") return module.compress(cls_bytes)

SimonBlanke · 2026-01-23T07:15:34Z

openml/models/base/_base.py

+            return _safe_import(obj_loc, pkg_name=pkg_name)
+
+        if pkg_obj == "code":
+            exec(self._obj)


Is this line for importing a module again? Can't we use something different then 'exec'?

SimonBlanke · 2026-01-23T07:31:03Z

openml/models/base/_base.py

+        if pkg_obj == "code":
+            exec(self._obj)
+
+            return obj


What is 'obj' where is it defined? I do not see it in this scope.

PGijsbers

Hi Franz,

I was writing up a longer response at first, but in the process I got more unsure about the proposal, so I'd like to clarify a few things first.
Some of these remarks reference the design document you provided separately.
Ultimately, it will be best to provide a cohesive proposal on our Design and Feature Request discussion board, as I think that this proposal affects the core infrastructure and/or responsibilities of OpenML, which would mean it will ultimately also need to be discussed with all core contributors (also outside of openml-python).
But I would be ok first continuing the discussion here.

For the public, I will use some terms that Franz introduced there, most importantly:

Estimator Classes (e.g, sklearn's RandomForestClassifier, autosklearns AutoSklearnClassifier)
Estimator Instances (e.g., sklearn's RandomForestClassifier(n_trees=42))
Fitted Instances (e.g., RandomForestClassifier(n_trees=42).fit(X, y))

The most important ideas of this proposal are to:

introduce a standardized interface to which each Estimator Class must adhere (not explicitly part of this PR). The interface would depend on the class of problems the Estimator supports.
provide a repository to easily share these standardized estimators in a way that makes loading them in a uniform manner possible.
replace the serialization of Estimator Classes into distinct OpenML concepts (e.g., separate schemas for parameters) with serialization into a single string, reducing the complexity of working with Flows.

Do you agree?

For completeness, these ideas are also part of the proposal, but they rely on the above ideas:

maintaining a built-in estimator repository that is shipped with the package (possibly based on some leaderboard)
the standardized interface can be used to significantly simplify "run model on task"-like functions which currently program to implementations.

Creating Estimator Instances
The idea is to facilitate an easier, unified way to create Estimator Instances.
They could then create estimator instances with a reference to the name, so instead of:

from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()

they would write:

import openml
rf = openml.get("RandomForestClassifier")

I reckon that the main advantage here is that we would know the latter object adheres to whatever estimator interface we define. If the imported estimator does not normally adhere to the shared interface, but the package manager saved a reference to some wrapped version of the estimator, it is now easier to use it out of the box. In a world where the packages already would provide class definitions that adhere to the common interface (e.g., xgboost added their XGBoostClassifier sklearn interface - and let's assume that's good enough), the only benefit would be that if the user only knows the name they would get an informative error message on what package to install (or openml could install it for them). In cases where the estimator is a pipeline with multiple components, multiple import statements may be saved.
Is that correct?

Separation of Estimator Class and Estimator Instance
I mention this because I think it was explicitly in one of the design documents, and I think it points to a misconception (either of my understanding of your proposal/terminology, or of your understanding of OpenML terminology). I believe this is the current function of Flows (Estimator Classes) and Setups (Estimator Instance).
The Flow defines the algorithm (e.g., sklearn 1.6 RandomForestClassifier) and its hyperparameters (n_trees) and defaults, but not any specific configuration.
When an experiment is ran with e.g., RandomForestClassifier(n_trees=40, ...) a Flow is made for RandomForestClassifier and the run result will point to that Flow alongside the Setup which specifies the actual configured hyperparameters (n_trees=40, ...). The learned parameters (of a fitted instance) are not saved in principle (but can technically still be stored as a "run attachment", but we don't really support this in a systematic way).
Would you agree that this mapping is correct? Whether or not the specifications of Flows/Setups should be changed to be better/easier/... is a separate notion.

Serialization of Estimators
Currently, OpenML allows structured storage about information of the hyperparameters of each algorithm (Flow).
This includes each hyperparameter's type, default value, and recommended range (definition).
From my understanding, this proposal does away with this.
The only information about the Estimator that is retained is what is currently in the tags alongside the import path.
Is that correct?

The other part of the current flow structure is that it can be defined in terms of subflows (or components).
As such, on OpenML currently Pipeline([Imputer, RandomForest]) would result in three flows: the overarching flow and a separate subflow for Imputer and RandomForest each.
This proposal does away with this, and only maintains the top-level flow. Is that correct?

Versioning
The deferral of versioning to PyPI isn't entirely clear to me, especially in conjunction with the ability to instantiate Estimator Instances from a string.
How would the following code example work?

import openml
rf = openml.get("RandomForestClassifier(hyperparameter_a=23)")

Assume RandomForestClassifier comes from a package with is currently on version 1.8 and given that:

parameter defaults may change from version to version (e.g., hyperparameter_b's default was updated in version 1.5)
parameters may be introduced or removed from version to version (hyperparameter_a was removed in version 1.7)
behavior may change from version to version (e.g., a bug was fixed in version 1.6)
not all scikit-learn versions are available on all Python version (assuming there is no support for local builds)

Unique Namespace
Flows currently have two unique identifiers: the integer identifier as well as the (name, external version) tuple.
Either can be used to uniquely identify the flow.
While this is surely not the original vision, a specific version of a package is effectively just a namespace.
Keeping the "external version" to a constant would allow you to create a lookup-by-name for flows (/flow/exists/{name}/{version} REST API endpoint even supports indexed lookups).
It might be interesting to explore that direction as it would allow the implementation (at least of that aspect of the proposal) without changes to server infrastructure.

Thanks.

experimental model framework

20e37a2

fkiraly added the enhancement label Dec 31, 2025

pre-commit-ci bot and others added 2 commits December 31, 2025 01:22

[pre-commit.ci] auto fixes from pre-commit.com hooks

d79fbe5

for more information, see https://pre-commit.ci

Merge branch 'main' into models-experimental

937f03e

fkiraly and others added 7 commits January 9, 2026 21:26

move utils

dc15c66

Delete utils.py

91c138c

indexer sklearn

778b87e

Update _openml.py

abc0f49

Merge branch 'move_utils' into models-experimental

cd4495e

multi-import package

5a6161f

[pre-commit.ci] auto fixes from pre-commit.com hooks

f6f050a

for more information, see https://pre-commit.ci

geetu040 assigned fkiraly Jan 19, 2026

SimonBlanke reviewed Jan 23, 2026

View reviewed changes

PGijsbers reviewed Jan 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ENH] experimental design for model indexing framework #1583

[ENH] experimental design for model indexing framework #1583

Uh oh!

fkiraly commented Dec 31, 2025 •

edited

Loading

Uh oh!

codecov-commenter commented Jan 8, 2026 •

edited

Loading

Uh oh!

SimonBlanke Jan 23, 2026

Uh oh!

SimonBlanke Jan 23, 2026

Uh oh!

SimonBlanke Jan 23, 2026

Uh oh!

SimonBlanke Jan 23, 2026

Uh oh!

SimonBlanke Jan 23, 2026

Uh oh!

PGijsbers left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

[ENH] experimental design for model indexing framework #1583

Are you sure you want to change the base?

[ENH] experimental design for model indexing framework #1583

Uh oh!

Conversation

fkiraly commented Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Usage pattern for end users

Usage patterns for estimator contributors

Intended backend interaction pattern

for publish / populate

for query

Uh oh!

codecov-commenter commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

SimonBlanke Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

SimonBlanke Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

SimonBlanke Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

SimonBlanke Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

SimonBlanke Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

PGijsbers left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

fkiraly commented Dec 31, 2025 •

edited

Loading

codecov-commenter commented Jan 8, 2026 •

edited

Loading